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Abstract —A general approach for anomaly detection or novelty 
detection consists in estimating high density regions or Minimum 
Volume (MV) sets. The One-Class Support Vector Machine 
(OCSVM) is a state-of-the-art algorithm for estimating such 
regions from high dimensional data. Yet it suffers from practical 
limitations. When applied to a limited number of samples it 
can lead to poor performance even when picking the hest 
hyperparameters. Moreover the solution of OCSVM is very 
sensitive to the selection of hyperparameters which makes it 
hard to optimize in an unsupervised setting. We present a 
new approach to estimate MV sets using the OCSVM with a 
different choice of the parameter controlling the proportion of 
outliers. The solution function of the OCSVM is learnt on a 
training set and the desired prohahility mass is obtained by 
adjusting the offset on a test set to prevent overfitting. Models 
learnt on different train/test splits are then aggregated to reduce 
the variance induced by such random splits. Our approach 
makes it possible to tune the hyperparameters automatically and 
obtain nested set estimates. Experimental results show that our 
approach outperforms the standard OCSVM formulation while 
suffering less from the curse of dimensionality than kernel density 
estimates. Results on actual data sets are also presented. 

I. Introduction 

An anomaly is defined as any observation that does not 
conform to the expected normal behavior H). The goal of 
anomaly detection also referred as novelty detection is to 
identify abnormal observations without previously knowing 
them. Applications include machine fault detection, network 
intrusion detection in cybersecurity or fraud detection in 
finance. Given observations Ai,..., A„ G K'^, d > 1, inde¬ 
pendent and identically distributed realizations of an unknown 
probability distribution P, we would like to learn a subset of 
such that points lying inside this set will be considered 
as normal and points lying outside will be considered as 
anomalies. The implicit hypothesis made in this context is 
that anomalies correspond to rare events and are located in the 
tail of the distribution. A possible approach is to estimate the 
subset corresponding to the region where the data are most 
concentrated. Such a region is called a Minimum Volume 
(MV) set, i.e., the set of minimum volume with probability 
mass at least a, with a close to 1. 

The notion of MV sets has been introduced by Polonik ||2l . 
Let /r be the Lebesgue measure and a G (0,1). A MV set 
with mass at least a is a solution of the following optimization 
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problem 

min u(G) such that PiG) > a , (1) 

GeB(R‘^) 

where is the set of all measurable subsets of 

We assume that the probability measure P has a density 
h with respect to the Lebesgue measure /i and that h has no 
flat parts, i.e., /r({x, h{x) = r}) = 0 for all r > 0. One can 
show that under regularity assumptions on h, the optimization 
problem ([T]i has an unique solution G* (up to subsets of null 
p-measure). This solution satisfies P(G* ) = a and is a density 
level set, i.e., a set of the form {h > r}, r > 0 ID. A MV 
set is thus a density level set. The converse holds with no 
assumption on the density: density level sets are MV sets. 

There are essentially two different approaches to estimate a 
MV set. The first one is to resort to a plug-in approach where 
one first estimates the underlying density and then thresholds 
it at the level Tq such that P{{hn > Tq,}) = a where is 
a density estimator. The main drawback of this approach is 
that plug-in estimators do not scale well with the dimension 
(for e.g. see Sj-lbl). Moreover the entire density is estimated 
while just a density level set is needed. 

The second one is to resort to a direct approach by choosing 
the set of minimum volume containing a proportion a of the 
sample points among a class of sets such as Glivenko-Cantelli 
or Vapnik-Cervonenkis classes. Direct approach algorithms 
include algorithms from I?), IS) and the OCSVM 13, ifTOl . 
Scott and Nowak 0 introduce a framework analogous to 
the empirical risk minimization in binary classification to 
estimate a MV set. Davenport et al. ja use a Neyman- 
Pearson classification approach to estimate MV sets with 
SVMs or any other classification algorithms. Tax and Duin 
ifTOl introduce the Support Vector Data Description (SVDD) 
algorithm to search for the hypersphere with the minimum 
volume containing at least a proportion a of data sample in 
a Reproducing Kernel Hilbert Space (RKHS). If the kernel 
used is the Gaussian kernel then the OCSVM and SVDD are 
equivalent im. 

While the problem of anomaly detection is unsupervised, it 
is known that an unsupervised problem can be transformed 
into a supervised one ca. Steinwart et al. ina introduce 
a classification framework for density level set estimation. 
The classification is performed between the original data and 
an artificial second class. The density level set {h ^ Tq,} 



Fig. 1. Application of the OCSVM with v = 0.4 on a Gaussian mixture 
sample of size n = 1000. In blue the estimated set, in black the level sets of 
the solution function of the OCSVM, in red the support vectors. The solution 
function captures the structure of the tail of Gaussian mixture distribution. 


can then be learnt with any classification algorithm without 
estimating the entire density h. However one still needs to 
choose the threshold corresponding to a mass a which can be 
computationally expensive. 

The OCSVM algorithm introduced by Scholkopf et al. 121 
is one of the most popular algorithm for anomaly and novelty 
detection. In m, Vert and Vert show that the OCSVM is 
a consistent estimator of density level sets. In fact they give 
a more powerful result: the solution function returned by the 
OCSVM gives an estimate of the tail of the underlying density 
h. The OCSVM is mainly applied with the Gaussian kernel 
and the performance highly depends on the kernel bandwidth 
selection. 

With the formulation introduced by Scholkopf et al. ii, 
the mass of the estimated set is controlled by a parameter 
ly specified by the user. The estimated set is guaranteed to 
contain at least a fraction 1 — i/ of the data. However simple 
simulations show that the OCSVM can perform very poorly 
to estimate a MV set for a finite data sample. For instance 
for a Gaussian mixture such as the one in Figure [T] no value 
of the kernel bandwidth gives a good approximation of the 
true MV set with mass at least 0.95 when the parameter v is 
chosen such that the empirical probability of the estimated set 
is larger than a (see section Ull-Al) . However using a different 
value of V, the set estimated by the OCSVM clearly differs 
from the true MV set but the solution function captures the 
structure of the tail of the underlying distribution as shown in 
Figure [T] 

The approach we propose and describe in the second part 
of this paper consists in fixing ^ at a value such that the 
proportion of points outside the estimated set will be strictly 
greater than 1 — a. The solution function is learnt on a training 
set and then thresholded to obtain the desired probability mass 
on a test set to prevent overfitting. To reduce the variance 


induced by the random split of the data set into a training 
set and a test set we aggregate several models. Thresholding 
the solution function of the OCSVM to obtain the desired 
probability mass is an approach that has already been very 
briefly mentionned in ||9l and in IITSl . However, to the best of 
our knowledge, such an approach has never been considered 
thoroughly. In the second part of this paper we present the 
OCSVM and its properties before presenting our approach. In 
the last part we compare the performance of our approach with 
the OCSVM on simulated data sets and apply our approach to 
real data sets. Connections can be made between this paper and 
ifTfil in which Filipone et al. apply the possibilistic c-means 
algorithm in kernel-induced spaces. 


H. Method 


A. Background on One-Class SVM 

The OCSVM was introduced by Scholkopf et al. ||9l to esti¬ 
mate high density regions from a data sample. After mapping 
the data in a feature space through a function $ determined by 
a specific kernel k the OCSVM finds a separating hyperplane 
between the origin and the mapped data. The separating 
hyperplane defined by a vector w and an offset p is given 
by the solution of the following optimization problem 


min 

'u',4,p 


s.t. 


i—1 

{w,^{xi)) > p - , l<i<n 

6 > 0 , l<i<n 


( 2 ) 


where i/ G (0,1) is a parameter specified by the user. This 
problem is convex and as strong duality holds it is solved 
through its dual 

min ^ V iajk{xi,Xj) 

7 Z ^^ 

S.t. 0 < 7 i < — , 1 < i < n (3) 

un 

n 

5Z7^ = 1 

The resulting solution function is given by 

n 

X y^ '-fik{x,Xi) 

and the resulting estimated MV set by 

n 

G = {x,^jik{x,xi) - pi, >0} (4) 

i=l 

where denotes the p solution of (|2]i. 

As with SVM in supervised settings, not all the 7 ^ are non¬ 
zero. The points Xi such that ji > 0 are called support vectors 
(SVs). Support vectors are exactly the samples located outside 
or on the border of the set G: 

n 

{Xj,l < 3 <n,'^-fik{xj,Xi) - p^ <Q} . 

2=1 










Outliers are exactly the samples that are located strictly 
outside the set G: 

n 

{xj,l < j < 0} . 

i=l 

The parameter i/ needs to be chosen by the user. We have 
the following property 


Proposition 1: Assuming the solution of dU satisfies pi, > 
0 the following statement holds 


i) 


ii) 


z/ is an upper bound on the fraction of outliers and a 
lower bound on the fraction of SVs 
Outliers SV 


<v < 


n n 

If the data were generated independently from a distribu¬ 
tion P absolutely continuous with respect to the Lebesgue 
measure and if the kernel k is analytic and non constant 
then 

■ V almost surely 


n 

Outliers 

n 


V almost surely 


This property is of great interest in practice. It gives the 
user some insights on how to choose the parameter v. Indeed 
the empirical probability of the estimated set is greater than 
1 — v and the probability of the estimated set converges almost 
surely to 1 — as n tends to infinity. Hence one possible 
approach is to choose z^ = 1 — a to estimate a MV set with 
mass at least a oa, d. 

In the following the kernel k is the Gaussian kernel kcr,a > 
0, and is defined as 


kcr{x,x') = exp (^-^\\x - 
We denote by fa- the solution function 


..'112 


fcrix) ='^"ftkcr{x,Xi) . 
i=l 

The paper of Vert and Vert lfT4ll proves the consistency of the 
OCSVM for density level sets estimation and hence for MV 
sets estimation. The optimization problem associated with the 
OCSVM studied in their paper is the following 


1 ” 

/ Tl . 

1 — 1 

where is the RKHS associated to the normalized Gaussian 
kernel and A > 0 a regularization parameter. 

Vert and Vert HI prove that for a well calibrated kernel 
bandwidth a, the OCSVM is a consistent estimator of every 
density level sets of level t G (0,2A). To show such a 
result they prove that the solution of the OCSVM when a 
normalized Gaussian kernel is used converges in norm and 
in probability to the underlying density truncated at 2A; 


lim \\f^ — h\\\L 2 = 0 in probability 

n^+oo 


where 


h{x) 


h\ = 


2A 

1 otherwise. 


if h{x) < 2A 


Remark 1 (Connection with kernel smoothing): If v = 1 
the constraints of the dual problem Q give 7 i = ^ for all 
i G {1,..., n}. This means that all the samples are taken into 
account in the solution and the solution function is 

1 " 

faix) = - ^k„{x,Xi) . 

This function is the one we recover when performing a 
kernel smoothing with the same kernel bandwidth a in all 
the directions. 

The advantage of the OCSVM over a kernel smoothing is 
that the estimated set is only characterized by the support 
vectors which, for small values of v, represent a small fraction 
of the sample size: the solution is sparse. This property is 
useful when performing the prediction task which is therefore 
less expensive than when using a kernel smoothing approach. 
Besides the solution function gives an approximation of the 
tail of the underlying density and, unlike a kernel smoothing, 
the approximation given by the solution function can be very 
bad elsewhere. This is why classification is sometimes said 
to be easier than regression HD: we only want to be good in 
a neighborhood of the border of the set of interest and not 
elsewhere. 


Eventually, parametrization of the mass of the MV set 
estimated by the OCSVM via the parameter v does not allow 
to obtain nested set estimates as the mass a increases. For 
each V a new optimization problem is solved and nothing 
ensures that the different set estimates are nested. Variants 
of the OCSVM that ensure this property have been introduced 
EO), mi- With our approach, the mass of the MV set is 
parametrized through the offset and this allows us to produce 
nested sets in a neighborhood of the estimated MV set with 
mass at least a. For the same solution function, we select 
different offsets p, one for each mass. 

B. Automatic Calibration of OCSVM 

We want to estimate a MV set with mass at least a with a 
close to 1 from the sample Xi ,..., Xn- Thanks to the result of 
Vert and Vert m, we know that the solution function of the 
OCSVM gives an approximation of the tail of the underlying 
distribution. More precisely in our approach we use the fact 
that fcr is an approximation of the underlying density in a 
neighborhood of the border of the MV set. The algorithm we 
propose is described in Figure and detailed hereafter. 

First the data set X — {Xi,... ,X„) is randomly split in 
a training set Xtrain and a test set Xtest respectively of size 
n-train and ritest- Let G be the set estimated by the OCSVM 
on the training set. The parameter z/ is chosen such that we 
are able to estimate the underlying distribution for the interval 
of masses [a — c, a + c] where c > 0. Therefore z/ must be 
chosen such that Pntra.in. ^ ct — c, where Pntra.i-n denotes 





Input: parameter v, mass a, data set X, kernel bandwidths 
set S, c > 0 

Randomly split X in a training set Xtrain and a test set 

^test 

for kernel bandwidth u in S do 
fa- = OCSVM(l^, a, Xtrain) 
for /3 in [a — c, a + c] do 

Bisection search to find such that 

Pnt^st ) = /3 

where = {x, fa{,x) - pp>Q} 

Computation of by Monte Carlo 

integration 

end for 
end for 

Compute Area under the Mass Volume curve (/3, ) for 

each cr: AMV((t) 

<Jopt = argmin^g2 AMV((t) 

return = {x, fa,^, (x) -pa>Q} 

Fig. 2. Algorithm of the OCSVM with a calibrated offset and the selection 
of the optimal kernel bandwidth 

the empirical probability measure based on the training set. 
Pntrai-n (^) ^ — c is equivalent to a fraction of outliers, 

points lying outside G, greater than 1 — (a — c). What we 
have from proposition [T] is that the fraction of outliers is less 
than V for all n and converges almost surely to as n tends 
to infinity. The closer i/ is to 1, the more outliers we allow 
the OCSVM to find. If v has been set such that the fraction 
of outliers is less than 1 — (a — c), then a higher value should 
be chosen. As we only consider values of a close to 1, we 
do not need v to be too close to 1 and can therefore preserve 
the sparsity of the OCSVM. In our algorithm we assume that 
a good value for v is known and is set independently of the 
data set. 

The function fa gives an approximation of the tail of the 
distribution. Consequently thresholding it at p^ such that 
Pntestifcr Pa) = Oi should offer an approximation of 
the MV set with mass at least a, where Pnte^t denotes the 
empirical probability measured based on the test set. 

Remark 2: Let ai < • • • < un be N values in [a — c, a + c] 
and let Pi > ■ • ■ > pM be such that for all i € {1,..., A^} 
we have Put^^tifo- > Pi) = cti- Let Gi be the set Gi = 
{x, fa{x) > pi\, then by construction the following holds 

Gi C • • • C Gat . 

C. Performance metric and kernel bandwidth selection 

To assess the performance of our approach and select the 
kernel bandwidth we need a performance metric. The kernel 
bandwidth parameter selection is an important task in practice 
as the solution of OCSVM highly depends on its choice. Low 
values of a lead to overfitting. On the contrary, high values of 
(T lead to underfitting. 


A performance metric used for the theoretical study of MV 
sets or density level set estimators is the Lebesgue measure 
of the symmetric difference between the true MV set G* and 
the estimate G, p(G* AG) where AAB = {A\B) U {B\A) 

Q, o, Ea. 

This performance metric depends on the true MV set G*. 
We use it to assess the performance of our approach and select 
the optimal kernel bandwidth when we have access to the true 
MV set. 

Several performance metrics have been used to assess the 
quality of one-class classification algorithms and select the 
optimal hyperparameters (see among others il, EOl, ED, 
||23). It is noteworthy to say that all these metrics require 
to sample points uniformly, either to compute the volume 
of the estimated set or to generate an artificial second class. 
Therefore both method suffer from the curse of dimensionality. 
First, the proportion of points uniformly sampled in the 
hypercube enclosing the data lying in the estimated set can 
decrease exponentially to 0 with the dimension. Second, for 
high dimensions, data are expected to be very sparse and to be 
very easily separated, leading classification solutions to overfit. 
We must therefore limit the use of these metrics to data sets 
of low dimension, for e.g. d < 10. This has been mentionned 
by Tax in HIl, E3- 

The performance metric we decide to use in our algorithm 
to select the kernel bandwidth is the Mass Volume curve 
introduced by Clemen^on and Jakubowicz ll24l and defined 
as {(a,p(G*)),a € (0,1)}. To use this performance metric, 
we still need to sample points uniformly to compute the 
volume. The Mass Volume curve is a functional criterion 
that can be used to assess the quality of a scoring rule 
in the unsupervised setting. The Mass Volume curve of the 
true underlying distribution is the lowest Mass Volume curve 
that can be obtained. Clemen^on and Robbiano ll25l give 
the explicit relation between the well known area under the 
ROC curve (AUC) and the area under the Mass Volume 
curve. Minimizing the area under the Mass Volume curve is 
equivalent to maximizing the AUC when the second class has 
been generated from a uniform distribution. 

The Mass Volume curve is suited to assess the quality of 
scoring rules whereas the first purpose of the OCSVM is not 
to estimate a scoring rule. Indeed, the OCSVM with i/ = 
1 — a gives an estimated set of the form {x,fa{x) > p^}. 
However there is no guarantee that for all p f p^, sets of the 
form {x,fa{x) > p} are good approximations of MV sets. 
Our approach estimates a scoring rule for the points located 
in the tail of the distribution and we use the area under the 
Mass Volume curve for masses in a neighborhood of a as a 
performance metric to select the best kernel bandwidth. 

To compute the Mass Volume curve, {{P{Gp), p{Gp)), f3 G 
\a — c, a c \}, we need to compute the probability and the 
volume of the estimated set. The probability is estimated on 
the test set and is thus equal to /3 as we choose the offset 
such that the empirical probability of the estimated set on the 
test set equals jd. We estimate the volume by Monte Carlo 
estimation. 


Volume computation: The volume of a set G = {cc, fa{x) > p\ 
is defined as 

p.{G) = J lGix)fj.{dx) . (6) 

This integral cannot be computed exactly so we resort to 
Monte Carlo estimation. As we do not know how to sample 
uniformly in the set G either we resort to importance sampling 
rewriting dSI as 

^(G) = [ ^^^q{x)fj,{dx) (7) 

J q[x) 

where q must be a well chosen distribution. 

The most popular distribution used in the literature is the 
uniform distribution over the hypercube Gc enclosing the data. 
Let Vc be the volume of Gc then the density of such a 
distribution is qdx) = 

p(G) = Vc [ ^^^^\ qcix)p,{dx) =Vc [ lG{x)qc{x)p{dx) 

J IgcW j 

= V,E,J1 g(Z)] . 

Thanks to the Law of Large Numbers the volume /j.(G) is 
estimated by 

m 

/ic(G) = —VIg(^z) Z,^q,. 
m ^' 

i=l 

Sampling uniform data is an issue worth mentioning as it 
is the factor limiting the estimation of Minimum Volume sets 
in a high dimension setting. 

D. Aggregation 

In section HLBI we presented our approach consisting in the 
following: 

1) Randomly split the data set in a training set and a test 
set 

2) Train the OCSVM on the training set to obtain 

3) Find the offset pa such that Pnte^t^ifa > Pa}) = a on 
the test set 

Randomly splitting the data set in training and test sets 
introduces variance in the result. To reduce the variance we 
aggregate several models based on B train/test splits. Let 
(fay Pa)’ ^ < b < B he the models obtained, where is 
such that 

M^) - /5a > 0}) = a . 

Averaging all the models we obtain 

^ b^l 

The final estimated set is given by 

Gf = {x,F«(a:)>0} . 


Input: parameter u, mass a, data set X, kernel bandwidths 
set E, c > 0, number of models B 
for 6 in {1,..., 5} do 

Randomly split X in a training set Xtrain and a test set 

^test 

for kernel bandwidth cr in E do 
= OCSVM( />', (J, A^iram) 
for /3 in [a — c,a + c] do 

Bisection search to find such that 

) = P 

where G^, = {a:, /^(x) - /5^ > 0} 

end for 
end for 
end for 

For all P and all a, compute the volume of the set 

{x,F^p{x) > 0} where ^ YlLiifai^) “ P^) 

Compute Area under the Mass Volume curve (/3, for 
each a: AMV(cr) 
o-opt = argmin^gs AMV(cr) 

return Gf = {x, F^^^^^^{x) > 0} where = 

^Etiiflj^)-Pi) 

Fig. 3. Aggregation of the models learnt on different train/test splits 

The algorithm is described in Figure |3] 

Proposition 2 (Nested sets): Considering several values 
0 < ai < • ■ • < aN < 1, we can construct nested sets 

Gf^C---cGf^. 

Proof: For i C {1,..., A^}, let (/^’*, p\), 1 < b < B he 
the models obtained on the sequence of training and test sets 
for the mass a^. We have for alH, j C {1,..., A^} 

as only depends on the train and test split. By construction 
we also have p\> ■ ■ ■ > p% for all b. Then for all b, 

fy-p1<---< - P^N ■ 

By summing 

< ... < 

and if Gf. = {x,F^’*(x) > 0} then 

Gf^ C • • • C Gf, . 

■ 

III. Experiments 

For all the experiments we choose ly = 1 — a for the 
OCSVM and u — 0.4 for our approach. With our approach 
80% of the data set is used as the training set and the other 
20% as the test set. Unless stated otherwise, a = 0.95, the 
Mass Volume curves are made from 10 masses equally spaced 
between 0.91 and 0.99 and we uniformly sample 10000 points 
in the smallest hypercube enclosing the data to compute the 
volumes. All the experimental work was done with Scikit-learn 
ll^ using the underlying LIBSVM library lIZTl . 






A. Simulation with bimodal distribution 


We sample n = 1000 points from a two- 

dimensional Gaussian mixture of density h{x) = 
iAr((2.5,2.5),/)(a;) -f iA/'((7.5,7.5),/)(a;) where / 
denotes the identity matrix and J\f{m,Tj){x) the density of 
the Gaussian distribution with mean m and covariance S. 
We want to estimate the MV set with mass at least 0.95 
from this sample. Knowing the density, we only need the 
level Ta such that P{h{X) ^ Tq) = a to know the true 
MV set G* . Ta is the 1 — a quantile of the distribution of 
h{X). We estimate such a quantile with 1 million points 
generated from h. To compute the volume of the symmetric 
difference between the estimated set and the true MV set 
we sample points uniformly in the hypercube enclosing the 
data. Our approach is implemented with an aggregation of 10 
models. The comparison of the performance as a function of 
a between the OCSVM and our approach is shown in Figure 
|4] We observe that the performance of the OCSVM obtained 
for the best value of a, i.e., the value of a minimizing this 
performance, is worse than the performance reached for a 
wide range of values of cr with our approach. We represent the 
sets obtained for the values of a giving the best performance 
for each approach in Figures |5] and |6] The solution obtained 
with our approach is clearly better. Besides, even the solution 
obtained for the best a of OCSVM tends to overfit (Figure 
0 . 



Fig. 4. Performance as a function of cr: OCSVM (dashed line) and our 
approach with an aggregation of 10 models (solid line) 

In Figure |7] we show the evolution of the measure of the 
symmetric difference between the true and the estimated MV 
set with mass at least 0.95 as a function of the number of 
samples. The results are averaged over 100 repetitions. For 
each sample size, the best a is computed by minimization 
of the area under the Mass Volume curve for our approach 
and through minimization of the measure of the symmetric 
difference for the OCSVM. Again in the case of OCSVM 
the ground truth is assumed to be known for parameter 
tuning while our approach automatic aly tune a without the 
knowledge of the ground truth. Despite this, our approach 
outperforms the OCSVM when we consider the measure of 



Fig. 5. In dashed line the true MV set. In solid line the estimated MV set 
for the best a of the OCSVM with respect to the measure of the symmetric 
difference between the true and the estimated MV set shown in Figure |4] 



Fig. 6. In dashed line the true MV set. In solid line the estimated MV set 
for the best cr of our approach with respect to the measure of the symmetric 
difference between the true and the estimated MV set shown in Figure |4] 

the symmetric difference between the true and the estimated 
MV set metric. The approach with aggregation further 
improves the performance without. 

B. Bimodal distribution with outliers 

We now considered a two-dimensional Gaussian mixture 
sample to which we add 5% outliers uniformly distributed over 
an hypercube enclosing the data. We thus sample n = 1000 
points from the distribution with density h{x) = 0.475 • 
A7((2.5, 2.5),I)ix) + 0.475 • A7((7.5, 7.5), I){x) + ^lc{x) 
where C = [—2,12] x [—2,12] and Vc is the volume of 

C. Knowing the density, we proceed as in section IIII-AI to 
compute the true MV set G* of such a distribution. For the 
OCSVM we choose the value of a minimizing the measure 















Fig. 7. Performance as a function of the number of samples n. OCSVM 
(dashed line), our approach without aggregation (solid line) and our approach 
with an aggregation of 3 models (dotted line). 

of the symmetric difference between the estimated set and the 
true MV set. The estimated set is shown in Figure [8] Our 
approach is implemented with an aggregation of 10 models. 
We consider 20 values of a equally spaced between 0.01 and 
3. The best a is obtained by minimization of the area under 
the Mass Volume curve. The estimated set is shown in Figure 
|9l This experiment suggests that our approach is more robust 
to outliers than the OCSVM. 



Fig. 8. In dashed line the true MV set. In solid line the estimated MV set. 
Outliers are represented hy crosses. 

C. Comparison with plug-in approach 

In this section we compare the performance of the plug¬ 
in approach with our approach with respect to the number 
of features d for a Gaussian mixture. We recall here that 
the plug-in approach consists in estimating the underlying 
density and then thresholding it at the level Tq such that 



Fig. 9. In dashed line the true MV set. In solid line the estimated MV set. 
Outliers are represented by crosses. 

P{{hn > To}) = a. The performance metric used to compare 
both approach is the measure of the symmetric difference 
between the true and the estimated MV set with mass at least 
0.95. We generate a Gaussian mixture sample of size n=500 
with density h{x) = \jC{2.b-ld,Id){x)-^\N{l.b-ld,Id){x), 
\d denoting the vector of with all its components equal 
to 1 and Id denoting the identity matrix of dimension d. 
For the plug-in approach we use a kernel density estimator 
hn to estimate h and a bisection search to estimate Ta. The 
kernel used is the Gaussian kernel with same bandwidth s 
in all the directions. The bandwidth s is selected through 
a 4-fold cross validation among 15 values equally spaced 
between 0.1 and 10. Then we threshold hn at Tq, such that 
Pn{hn ^ fa) = a where is the empirical probability 
measure based on the sample of size n. Our approach is 
performed with an aggregation of 5 models and the kernel 
bandwidth is automaticaly selected through minimization of 
the area under the Mass Volume Curve. In Figure [TO] we 
show the evolution of the performance for both approach. 
The results are averaged over 100 repetitions. Even though 
the performance of both approach is quite similar for d = 2 
and d — 3, for d > 3 we observe that the performance of 
the plug-in approach deteriorates much more faster than the 
performance of our approach. We limit this experiment to 
d = 8 because of the difficulty to compute volumes in high 
dimension. 

D. Two moons data set 

We generate a two-dimensional two moons data set of size 
n = 2000 and try to estimate a MV set with mass at least 
0.95. We choose 30 values of a equally spaced between 0.01 
and 0.5. We average 25 models based on 25 train/test random 
splits of the data set. The best a obtained by minimization 
of the area under the Mass Volume curve is a = 0.15 (see 
Figure fTTI i. The estimated set is represented in Figure [TJ] Its 
empirical mass on the whole data set is 0.96. 
















E. Real data set 



Fig. 10. Performance as a function of the number of features d. Plug-in 
approach (dashed line) and our approach with an aggregation of 5 models 
(solid line). Our approach cleaidy outperforms the plug-in approach as soon 
as dimension d increases. 



Fig. 11. Area under the mass volume curve as a function of a for the two 
moons data set. The minimum is reached at cr = 0.15. 


We consider here the Boston housing data set Il28ll from 
the UCI machine learning repository. This data set concerns 
housing values in suburbs of Boston and consists in n = 506 
samples and d = 14 features which can be either categorical, 
integer or real. We only consider two of the features for a 
better representation of our approach: the average number of 
rooms per dwelling and the percentage lower status of the 
population. We first standardize the features, i.e., component 
wise center and scale to unit variance, and then apply our 
approach to estimate MV sets. We choose 30 values of a 
equally spaced between 0.01 and 4. We average 25 models 
based on 25 train/test random splits of the data set. The best 
a obtained by minimization of area under the Mass Volume 
curve is cr = 0.42 (see Figure [Ts]). The estimated sets are 
represented in Figure [141 The estimated MV set with mass at 
least 0.90 has an empirical mass of 0.91 on the whole data 
set and the estimated MV set with mass at least 0.95 has en 
empirical mass of 0.95. We observe that the estimated sets are 
nested. 



Fig. 13. Area under the mass volume curve as a function of cr for the Boston 
housing data set. The minimum is reached at ct = 0.42. 
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Fig. 12. Estimated MV set with mass at least 0.95 for a generated two 
moons data set 


IV. Conclusion 

This paper presents a new approach to estimate MV sets 
using the OCSVM algorithm. Results show that it outperforms 
the standard way to use the OCSVM. Our approach is based on 
the calibration of the offset of the solution function to obtain 
the desired probability mass on a test set. It allows to compute 
nested set estimates without the need to add any condition 
ensuring this property and consider several regularization 
parameters. Moreover it provides a scoring rule for samples 
located in the tail of the underlying distribution. The computed 
Mass Volume curve allows to assess the performance of the 
approach and to select the kernel bandwidth automatically. 
Our solution inherits the sparsity of the OCSVM which is a 
computational advantage over kernel smoothing. 

The kernel bandwidth selection requires to compute the 
volume of the estimated set which suffers from the curse 
of dimensionality. This issue is still an open research area. 
Sampling more precisely in the region where the data lives 
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Fig. 14. MV sets with mass at least 0.90 and 0.95 respectively in blue and in 
red estimated from the two features, average number of rooms per dwelling 
(x axis) and percentage lower status of the population (y axis), of the Boston 
housing data set. The features have been standardized. 

instead of sampling in the hypercube enclosing the data is a 
possible approach to scale to higher dimensions. 
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